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(57) The present invention provides a system and 
method for compressing a data sequence comprising a 
plurality of records, each record having a plurality of 
fields and each'fielo being arranged to contain a data 
item. The system comprises: comparison means which, 
for a current field within a current record other than the 
first record in the data sequence, is arranged tocompare 
the data item in the current field with the data item in the 
corresponding field of a preceding record; and match 
indication means, responsive to a signal from the com- 
parison means indicating that the data item matches the 
data item in the corresponding field, for replacing the 
current field data hem by a token indicating the match. 
The comparison means is arranged to repetitively per- 
form the comparison process on a predetermined 
number of fields in a plurality of records of the data se- 
quence. Preferably the system is incorporated within a 
server computer, the server being arranged to output the 
data records of the data sequence as compressed by 
the system for transfer over a network to a client com- 
puter. 

The technique of the present invention is a very 
quick algorithm, taking very little resource. It does not 
prevent the use of more traditional compression tech- 
niques, and is simple. Further, it recognises the field 
structure of the data and uses this as a method to 
achieve good compression. Its algorithm is not affected 
by the host computer architecture nor that of the trans- 
port layers. Additionally, it can reduce the CPU resourc- 
es required at the client side, thereby improving per- 
formance above and beyond the data reductions. 



RECEIVE QUERY 
PftOH CU£ NT 



— 700 



perform query i retrieve qata 

RECORDS FOPHiNG THE QU£RY RfSUL? 



l~f*SS A CAT A RECORD TO 
} COMPRESSING MEANS 




COMPARE EACH FlfLO WITH 
CORRESPONDING FIELD IN 

mgjiAreiY ptoqjng recqro 



IF FIELDS OCNl maTC ! 
STORE FlfLO tN 
'SEND RECORD' 



?70 



if fields match, 

STORE TCKEN IN 
SEND R-tCORD* 



[SfkrSENQ RECORD 'TO CLIENT F ^ 



?80 



SET IMMEDIATELY PRECEDING 
RECORD TO CURRENT RECORD 



— 290 



l^^til RECORDS^ 
\PR0CESSE0^ 



-310 



--- {DATABASE" 



STORE CURRENT 

RECORD IN 
SEND RECORD 



FIG. 2 



Prnt^rthv .iot.v- T.VXH PAHlS fPRt 



EP 0 789 309 A2 



Description 

Field of the Invention 

s- The present invention relates to the compression ot structured data, in particular data sequences comprising a 

plurality of records, each record having a plurality of fields and each field being arranged to contain a data item. 

Background Information 

10 Such data sequences are used widely in computer processing fields, as many computer applications involve the 

creation and manipulation of structured data. For tns:ance : such data sequences are used extensively in database 
systems. Generally in such systems, there will be a database server computer arranged to manage the data within the 
database. Client computers will connect to the server computer via a network in order to send database queries to the 
server computer. The server will then process those queries, and pass the results back to the client. These results will 

is generally take the form of a structured data sequence of the type discussed above (ie having a plurality of records, 
and each record having a plurality of fields with data items stored therein). For example, a database containing details 
of a company's employees would typically have a data record for each employee. Each such data record would have 
a number of fields for storing data such as name, age. sex, job description, etc. Within each field, there will be stored 
a daia item specific to the individual, tor example, Mr Smith, 37, Male, Sales executive, etc/Hence a query performed 

20 on that database will generally resull in a data sequence being returned lo the client which contains a number ot 
records, one for each employee meeting the requirements of the database query. 

Since data storage is expensive, it is clearly desirable to minimise the amount of storage required to store structured 
data. Additionally when a data sequence is copied or transferred between storage locations, it is desirable to minimise 
the overhead in terms of CPU cycles, network usage, etc. Within the database field, much research has been carried 

25 out in to techniques for maintaining copies of data. Generally, these techniques are reicrrcd to as 'data replication' 
techniques. The act of making a copy of data may result in a large sequence of data being transferred from a source 
to a target, which as mentioned earlier is typically very costly in terms of CPU cycles, network usage, etc. Within the 
database arena, this 'data replication' is often a repeated process with the copies being made at frequent intervals. 
Hence, the overhead involved in making each copy is an important issue, and it is clearly advantageous to minimise 

30 such overhead 

To reduce the volume of data needing to be transferred and the time required to copy a set of data, an area of 
database technology called 'change propagation* has been developed. Change propagation involves identifying the 
changes to one copy ol a set of data, and to only forward those changes to the locations where other copies of that 
data set are stored. For example, if on Monday system B establishes a complete copy of a particular data set stored 

35 on system A, then on Tuesday it will only be necessary to send system B a copy of the changes made to the original 
data set stored on system A since the time on Monday that the copy was made. By such an approach, a copy can be 
maintained without the need for a full refresh of the entire data set. However, even when employing change propagation 
techniques, the set of changes from one copy to the other may be quite large, and hence the cost may still be significant. 
Given the above problems, it is an object of the present invention to provide a technique for compressing structured 

to data which will alleviate the cost of maintaining and replicating structured data. 

Summary of the Invention 

Accordingly the present invention provides a method o' compressing a data sequence comprising a plurality of 
*s records, each record having a plurality ol fields and each field being arranged to contain a data item, the method 
comprising the steps ot (a) for a current field within a current record other than the first record in the data sequence: 
(i) comparing the data item in the current field with the data item in the corresponding field of a preceding record; (ii) 
if the data item maiches the data item in the corresponding field, replacing the current field data item by a token 
indicating the match, and (b) repeating step (a) for a predetermined number of fields in a plurality of records of the data 
so sequence. 

In preferred embodiments, the comparison step (a) is repeated for the predetermined number of fields in every 
record of the data sequence. However, there may be instances where it is desired to only perform the comparison on 
a subset of the records of the data sequence, and the invention is clearly applicable to such situations. Additionally, 
the comparison step (a) is preferably performed for every field in the current record. However, in some situations, it 
55 may be more efficient to apply some filtering such that some of the fields are not subjected to the comparison process. 
This may for example be the case for fields which contain only a few characters, because in such cases, the compres- 
sion achievable may not warrant the time spent performing the compression process. 

In preferred embodiments, at slep (a) (i), the data item in the current field is compared with the data item in the 
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corresponding field of the immediately preceding record, and the method further comprises, subsequent to performing 
the comparison step (a) lor the predetermined number of fields-in the current record, the step of storing the uncom 
pressed current record as the immedetely preceding record for use in the performance ol said step (a) (i) on the 
predetermined number ol fields of a subsequent record. This provides an efficient technioue for performino the com- 
parison step, and avo.ds the need lor the comparison moans to reta.n .nlormation about more than one preceding data 
record. f 3 

Further, in preferred embedments, steps (a) and (b) are performed by a processor of a server computer the 
method compnsmg the furthe- step of sending the data records ol the data sequence as compressed by steps (a) and 
(b) over a network to a client computer 

The token used in step (a) (ii) may take any appropriate form. However, to maximise compression, it is preferable 
lor the token to take the form ol a predetermined single character and in preferred embodiments, the token is the ' ■ 
character. However, the token may be any appropriate character(s) which is/are recognisable as the token 

The present invention further provides a method of decompressing a data sequence compressed according to the 
above described method, comprising the steps of: (A) for each field within a current record other than the first record 
in the data sequence: determining whether the field contains the token; if the field does contain the token, replacing 
the token by the data item in the corresponding Held ol a preceding record; and (8) repeating step (A) lor a plurality of 
records ol the data sequence. y 

Viewed from a second aspect, the present invention provides a system tor compressing a data sequence com- 
prising a plurality of records, each record having a plurality of fields and each field being arranged to contain a data 
item, the system comprising: comparison means which, lor a current field within a current record other than the l.rsl 
record in the data sequence, is arranged to compare the data item in the current field with the data item in the corre- 
sponding field of a preceding record; and match indication means, responsive to a sional from the comparison means 
indicating that the data item matches the data item in the corresponding field, tor replacing the current field data item 
by a token indicating the match: the comparison means being arranged to repetitively perform the comparison process 
on a predetermined number of fields in a plurality of records ol the data sequence. 

Brief Description ol the Drawings 

The present invention will be described further, by way of example only, with reference to a preferred embodiment 
30 thereol as illustrated in the accompanying drawings, in which 

, .. Figure 1 is a block diagram of a database server in accordance with a preferred embodiment of the present inven- 
tion, and 
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Figure 2 is a flow diagram illustrating the processing steps involved in compressing a data sequence in accordance 
with the preferred embodiment ol the present invention. 

Description of the Prelerred Embodiment 



The preferred embodiment of the present invention will be described with reference to a database server arranged 
lo process database quer.es from client computers connectable to the server over a computer network However it 
will be apparent tnat the present invention is applicable to any situation where it is desirable to represent a struciured 
data sequence in a compressed form, for example where storage space is limited or expensive 

In the preferred embodiment, we will consider the issue of transferring the results of a database query represented 
« as a data sequence using a text delimited system The data sequence will contain a plurality ol data records each 
data record representing one database entry and being referred to herein as a Vow' ol data. Each row ol data contains 
a set of fields delimrted (rom each other by some character (usually a comma) in a text lormat. This lormat is often 
known as ASCII delimited in the personal computer arena In such representations, number fields are often p.efixed 
by a +/• sign, strings are surrounded by quotes, and sometimes insignificant digits (trailing blanks, leading zeroes) are 
so dropped, for example: 

, Smith",'John t , + 26.-Vice-Presidenf,"Lamborghini Countach" 'JonesVAIexander", + 47,\lunior Under-Secretary" 
'Ford Escort" ' 1 

(whore tho fields reprosonl surname first namo, ago. job description, and car, rospoctivoly) 

However, the exact representation ol the 'rows' or data records is not relevant for the purposes ol the present 
ss invention; all that is required is that the compression system is able to identify the individual fields in the 'row' 

The system ol the preferred embodiment will now be described in more detail with relerence to figures 1 and 2 
When a user of a client computer 10 wishes to retrieve data from the database 20, he/she will construct a database 
query defining the .nlormation required from the database, and will send that database query to the database server 
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30 As will be appreciated by those skilled in the, art. the database server 30 wilt typically be provided on a network, 
and will hence be able to receive requests from any client computers capable ot accessing the daiabase server 30 
over thai network. 

Once the database server 30 has received the database query at step 200, it will employ a query processing means 
5 40 to process the database query. At step 210, the query processing means will access the database 20 and retrieve 
the data records matching the criteria set out in the database query issued by the client computer 10. As each row (ie 
data record) is retrieved, that row is passed at step 220 through the compressing means 50. Alternatively, the server 

30 may wait umil oil rows forming the query result have been retrieved, fand then pass the rows through the compressing 

means 50. The former approach is lavoured in preferred embodiments since it enables the server to begin outputting 
io compressed data to the client 10 as soon as possible, possibly even before all the rows of the query result have been 

retrieved by the query processing means 40. 

Due to the structured nature of the data records retrieved from the database, the compressing means is able to 

identify the individual fields within each data record. For instance in the preferred embodiment, the individual fields 

are separated by a comma, and the compression means is arranged to identify the comma character and hence the 
is end of each field. 

At step 230, it is determined whether the current record passed to the compressing means SO is the first record 
of the query result. It u is, then in prelerred embodiments, this first record is passed through the compression means 
50 unchanged. Hence, at step 240 the first record is stored as the 'send record", the send record being the record 
which will be returned to the client 10 

20 Each subsequent record is passed one-by-one through the comparison means 60 of the compression means 50. 

Hence, if at step 230 it is determined that the current record is not the first record, the comparison means 50 is arranged 
such that, for each field in a current record, the comparison means compares the contents of that field with the contents 
of the corresponding field in the immediately preceding data record (step 250). If the contents do not match, then at 
step 250 the contents of the field are stored in a 'send record representing the compressed form of the current record. 

2$ However, if the contents do match, then at step 270 the field of the current record is passed through the match indication 
means 70, where the data for that field is replaced by a 'token' in the send record, this token indicating that the content 
of that field is the same as the content in the corresponding field of the immediately preceding record. In preferred 
embodiments, the toKen is chosen to be a single character such as a since the use of a single character enables a 
good compression to be achieved. 

30 After all fields of the current record have been passed through the compressing means 50. the send record rep- 

resenting the compressed form of the current record is passed to the output means 60 for transmission to the client 
10 at step 260. The output means B0 may pass the send records one by one back to the client, or may wait until all 
data records in the query result have been processed by the compressing means 50, and then send all of the send 
records as a single file to the client 10. 

3S Once the send record for a current record has been passed to the output means 80, then at step 290 the uncom- 

pressed current record is stored as the 'immediately preceding record* for use by the comparison means 60 in the 
comparison step 250 performed on the next data record. 

Then, at step 300. it is determined whether all the records forming the query result have been processed, and if 
they have the compression process ends (step 31 0). Otherwise, the process returns to step 220, where the next record 

*o is passed to the compressing means 50. 

Once the client computer 10 has received the send records sent by the database server 30, these send records 
can be readily decompressed by the client computer. The first record will not be compressed, and so needs no process- 
ing by the client 10. The client would then review the second record for the presence of the token in any of the fields, 
and tor any fields having the token, the client would replace the token by the data item in the corresponding field of 

45 the first record. Once this process had been completed tor the second record, the client would keep a record of the 
decompressed second record, and then review the third record. Again, any tokens identified would be replaced by the 
data item in the corresponding field of the second record. Next, a record ol the decompressed third record would be 
kept, and the process would be repealed lor the fourth record, etc. 

As an example, consider the following records received by the client 10 as the first three records of the query result: 

so 

Record 1 : XXX, YYY. ZZZ 

Record 2: AAA...BBB 

55 Record 3: ....CCC 

Record 1 would be stored 'as is' by the client. Upon reviewing record 2, the token 7 would be identified and replaced 
by the data item YYY to yield a decompressed second row of "AAA.YYY.BBB*. This decompressed form would be 
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stored, and then the third record would be reviewed. Again, the two tokens would be identified and replaced by the 
data-items in the corresponding fields of the previous (ie second) record, yielding a decompressed third record ol "AAA 
YYY.CCC. this process would be repeated for all the send records in the query result. 

As will be apparent from the above description, the technioue of the preferred embodiment involve, removing 
repeatea l.eias and replacing them wuh a small representative token. It is based on an understanding ot the fact thai 
the data is held in fields which are capable of being repeated, as opposed to treating the data sequence as one stream 
of bytes 

This technique is best explained by a simple example. In this example, we will consider the situation of a database 
query issued at a client machine resulting in a server machine returning an answer set lor a report Some form of 
communications is assumed on a row-by-row basis, but the answer set could equally be returned as a data file etc 
The answer set may well have come from a 'pm across multiple tables' process, but this is irrelevant for the purposes 
of the present invention. The answer set used in this example is typical of those returned from a Data Warehouse or 
Management information System (MIS) application, and typical of a replication environment maintaining such a system 

Many fields in a database query/report are from a limited set of values. Examples such as 'Job Type' may only 
have certain values in a company e.g. Manager. Clerk., etc., Department, State, etc 

As an example, if a marketing report was trying to determ.ne the types of goods sold in all retail stores across the 
United States for a particular period, and 10 see if there was any geographical significance, the following query may 
be issued: " s r y 

SELECT GOODS, VALUE, STORE, STATE from whenever 
This may result in data of the following lorm being relumed to the client. 



25 



GOODS 

(50 Chars) 



INCOME 

(10) 



STORE 

(100 Chars) 



STATS 

(40 Chars) 



35 



"Hardware MOOOOO, "Unit 12, Raleigh Mali, Cary "North Carolina" 

'♦Software 90000, "Unit 12, Raleigh Mall, Cary ", "North Carolina" 

"Peripherals", 132000, "Unit 12, Raleigh Mall, Cary "North Carolina" 

"Supplies », 64C0C,"Unit 12, Raleigh Mall, Cary ", "North Carolina" 

"Magazines » , 2S5CCC , "Uni t 12, Raleigh Mall, Cary » # "North Carolina" 



40 



"Hardware ", -2000, "Pebble Mill Mail, Raleigh 

"Software »\ 74000, "Pebble Mill Mali, Raleigh 

"Peripherals", 1080QO, "Pebble Mill Mall, Raleigh 

"Supplies ", 77000, "Pebble Mill Mall, Raleigh 

"Magazines », 1 25000 Pebble Mi 1 1 Mall, Raleigh 



"North Carolina" 
"North Carol ina" 
"North Carolina" 
"North Carolina" 
"North Carolina" 



45 



etc 



Assuming there are: 

5 <> a) Unique randomly distributed values for INCOME; 

b) 5 classifications of GOODs: {Hardware/Software/Peripherals/Supplies/Magazines); and 

c) 40 STORES in each of fifty STATE s (2000 Stores in total) 



55 



this query would return 10,000 rows. (ie. '5x40x50'). Given this number of rows, the returned data size would be equal 
to two million characters (ie (50 *■ 10 + 100 ■+ 40) * 10,000). 

Using the technique of the preferred embodiment of the present invention, the higher the rate of repetition the 
better the effect of the compression will be. Therefore data that is ordered by fields that are likely to repeat will gain 
the most benefit 
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If ordered by STATE STORE GOODS then each state would only be sen! once in its full form, and each store 
would be sent once per slate in its full form Hence, for the above illustrated example, the data sequence would be 
reduced to the following form (where a V symbol is used as the token) . 

"Hardware «, 100000 , "Uni t 12, Raleigh Mall, Cary" , "North Carolina- 
's©? tvar« " * 90000, . , . 
"Peripherals", 132000, . , . 
••supplies - , 64000, . , . 
"Magazines " , 295000 , . , . 

"Hardware n . 12000, "Pebble Mill Mall, Raleigh ",. 
75 "Software " , 74000, . , . 

"Peripherals" ,103000,.,. 
"Supplies " , 77000, . , . 
"Magazines ",125000, 

20 

etc. 

25 The returned data size is calculated as follows: 



30 



35 



(50 rows * 40 Chars) * (9 9 50 rows * 1 char) 
((40 *50)rows * 100 chars) + (3000 rows * 1 char) 
(10, 000 rows * 10 chars) 
(10,000 rows * 50 chars) 



119 50 STATE 

208000 STORE 

100000 INCOME 

500000 GOODS 

819950 



By comparison with the earlier figure for the situation in which the data is sent in uncompressed form, it is apparent 
40 that use of the above technique results in a 60% reduction in transfer si2e. 

These results are entirely dependent on t,.e contents of the data If, for example, there were 100 types of goods 
instead of 5, the uncompressed da;a size would be 40 Million Characters (200,000 rows) and the compressed data 
size would be 12,599,950 characters. This would be a reduction of nearly 70% and nearly 30Meg less data to transfer. 
Equally there may be situations involving random data distribution where only a little improvement would be realised. 
45 To demonstrate the fact that even in situations where the data is unordered, the small set of values relative to the 

overall number of times used will still gain some benefit, we can use the same set of data as illustrated earlier, but 
organise it so that the data is effectively completely random. 

For the same query issued and ORDERED by INCOME (completely random distribution), then the following situ- 
ation arises. 

so 

GOODS 1 in 5 rows can be assumed to be a repeat 
STATE, i in 50 rows can be assumed to be a repeat 
STORE. in 2000 rows can be assumed to bo a repeat 

55 Hence, the following combinations of full and compressed format data will be achieved in the data sequence: 
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FULL REDUCED 



8000 * 


50 ) «■ 


(2000 - 1) - 


402000 


GOODS 


9800 - 


40 ) + 


(200 • 1) 


392200 


STATE 


9995 • 


100) + 


(5*1) 


999505 


STORE 



(10000 * 10 ) = 100000 INCOME 



18S3705 characters 



This still amounis to a 5 % reduction in transfer size. This is a significant saving even though the data is assumed 
completely evenly and randomly spread. It would be rare lor such an even spread and data which is skewed will 
probably have an improved effect on the compression. 

A lot of data will be skewed by nature Take for example the state field in the earlier example. We have assumed 
random distribution of this data across the fifty states. However, it is much more likely that the business in question is 
successful on say the 'west-coast* with half of its business in California alone. With half of the stores in California and 
a quarter in the next nearest 4 states - the data will be heavily skewed towards then-, - thereby increasing the hit rate 
in repetitions substantially over that achieved for the random query. 

Based on this data skew, the number of 'hits' in STATE would be calculated as follows' 

Half of the stores, and hence hall of the records, are in California, and those records have a one-in-two chance of 
having California as the state for the nexi record. Hence. 2500 of those records (10000/2/2) will be compressable for 
the STATE field, One quarter of the records (ie 2500) have a 1/16th chance of having the same state in the next record 
(since 1/4 willbeone of the four states, and 1/4 of those will be the same state). Hence 156 of those records (2500/15) 
will be compressable for the STATE field Finally, the remaining 1/4 of the records will be assumed to be evenly spread 
across the remaining states, which will have a 1 in 45 chance of repeating. Hence 1/4 of the next records will have a 
state which is one of the remaining states, and there is a 1/45 chance of that state being the same state, which gives 
a 1/180 chance of having the same state in the next record. Hence. 14 of those records (2500/1 30) will be compressable 
for the STATE field 

Adding the above figures up, 2670 records (2500+1 56+14) will have repeating states, and hence be compressable 
for the state field. By using this figure in the above calculation performed for a completely random distribution, this 
would give a transfer size of 1797375 characters (295870 characters for the state rather than 392200 in the random 
case), or a reduction of 10%. This improvement from 5% to 10% is due entirely to the skew No account is taken of 
the fact that this would also skew the STORES and have further improvement. 

When implementing the above described invention, a balance has to be found where the usual trade-off between 
CPU cycles and memory on the one hand versus transmission sizes on the other is not too severe as to impact overall 
performance. 

Some compression algorithms are very effective as data compressors but are very intensive as CPU operations 
and need the largest possible data size to get the maximum benefit. These algorithms usually build a dictionary of 
commonly used sets of characters. To be effective they need a set of data which is large enough to generate a good 
dictionary. 

In the examples given above, rows of 200 characters do not give much room for data compression algorithms of 
that type to be effective, if you try blocking large numbers of rows to become more effective you have to trade off 
memory for the blocking, extra cycles, non-busy line time, dynamic dictionary building, etc. These then become too 
costly to be effective in trying to obtain the necessary low cost throughput. 

The preferred embodiment of the present invention achieves excellent compression under certain circumstances 
by recognising the field structure of the data and looking for simple repetitions, thus allowing it to be used efficiently 
on a row by row basis. 

In a sorvor implementation such as tnat described oarltor. a field by field comparison ie porformod for any fiold 
which may beappropriate. In order to ensure this is kept to a minimum any form of filtering may be employed. Ideally 
this would be implemented efficiently with other relevant tasks such as type-validation. 

A simple example may be to only check character fields with a length between, say, 10 and 1000 characters. This 
assumes there is a cenain minimum value not worth the reduction and a certain maximum value over which it is unlikely 
to be a structured field containing repetitious data A more complex alternative might be based on the statistics of the 
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database to determine fields which have =» relatively low number o f distinct values to occurrences for single tahle 
queries In (act, these statistics may already be used by the database query optirniser typically provided to determine 
the best access path to retrieve the data, allowing a database query to have estimated the best candidates for repeating 
fields. 

5 As client CPU cycles are generally very important, the compression method should be very efficient. The algorithm 

used in the preferred embodiment can actually REDUCE the resource required at the client side. If we assume a copy 
from some form of communications structure to a structure usable by the client application. 

io Form comms pipe Structure to be handed to client app. 

Comms_Structure. STORE > sqlca. STORE 

Comms Structure. STATE > sqlca. STATE 

15 ~ 

Comms_Structure. INCOME > sqlca. INCOME 

Comms_Structure. GOODS > sqlca. GOODS 

20 A simple check may avoid the client from having to do any field work: 

if Comms_Structure. STORE (1} ' . ' 

DO_Nothing /• CLIENT DATA STRUCTURE ALREADY CONTAINS CORRECT , •/ 
S$ /• VALIDATED VALUE '/ 

else 

Do_whatever_Copy/Va li da t ion /Type - checking /Type - Conversion_you_l ike 

30 

When data is passed between different systems, there is often a requirement for the system to translate characters 
between different cede pages (for example, from ASCII io EBCDIC) As will be appreciated by those skilled in the an. 
this translation capability is needed because different computers represent the same characters in different ways This 
helps to support different computer architectures, and different national languages. The technique of the preferred 
35 embodiment enables a reduction in the amount of data to be transferred, without interfering with other processes such 
as the translation Irom ASCII to EBCDIC which have to be done regardless of whether the data is reduced or not, and 
further provides this reduction in a manner which is more efficient in terms of CPU usage, etc at both the transmitter 
and receiver end. 

Most replication solutions use change record formats and protocols which are specific to the databases that they 
40 support. The preferred embodiment provides a solution which can be employed in heterogenous environments with 
mixed - operating systems, - databases - networks., i.i addition to being applicable to specific environments. 

From the above description, it can be seen that the preferred embodiment of the present invention provides a 
technique for efficiently compressing structured data, that is data that is normally represented as a sequence of data 
records broken into separate fields, and having means to enable the various fields to be identified, such as by being 
45 delimited. 

In the preferred embodiment, the technique described enables the amount of data to be reduced in a way that is 
independent of factors such as the machine/database type, character sets, network protocol, etc. The technique can 
be integrated with the database system and the quer/ optirniser to produce an efficient means of data compression 
which is based on information about the structure of the data, this being available to the database system 

so By taking the structure of the data into account, and by realising the recurrent nature of certain data, the technique 

can be viewed as actually removing data from the data sequence, as opposed to compressing the data that is there. 
The resulting smaller data sequence could then still be subjected to other data compression techniques, for example 
the compression technique described in US Patent No 4,701,745, to thereby yield further improvements. 

Hence, in summary, the technique of the preferred embodimenl is a very quick algorithm, taking very little resource. 

55 it does not prevent the use of more traditional compression techniques, and it is extremely simple. Further, it recognises 
the field structure of the data and uses this as a method to achieve potentially excellent compression. Its algorithm is 
not affected by the host computer architecture nor that of the transport layers. Additionally, it can reduce the CPU 
resources required at the client side, thereby improving" performance above and beyond the data reductions 
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Claims 

1 . A method of compressing a data sequence comprising a plurality of records, each record having a plurality o! fields 
and each field being arranged to contain a data item the method comprising the steps of- 

(a) for a current field within.a current record other than the first record in the data sequence: 

(i) comparing (250) the data item in the current field with the data item in the corresponding field of a 
preceding record: 

(ii) if the data item matches the data item in the corresponding field, replacing (270) the current field data 
item by a token indicating the match; and 

(b) repeating step (a) for a predelermined number of fields in a plurality of records of the data sequence. 

2. A method as claimed in Claim 1 , wherein said step (a) is repeated for said predetermined number ot fields in everv 
record ot the data sequence. 

3. A method as claimed in Claim 1 or Claim 2. wherein the predetermined number of fields is every field in the current 
record. 

4. A method as claimed in any of claims 1 to 3. wherein, at step (a) (i), the data item in the current field is compared 
with the data item in the corresponding field of the immediately preceding record, and the method further comprises, 
subsequent to performing the comparison step (a) for the predetermined number of fields in the current record, 
the step (290) of storing the uncompressed current record as the immediately preceding record for use in the 
performance of said step (a) (i) on the predetermined number of fields of a subsequent record. 

5. A method as claimed in preceding claim, wherein steps (a) and (b) are performed by a processor of a server 
computer (30), the method comprising the further step of sending the data records of the data sequence as com- 
pressed by steps (a) and (b) over a network to a client computer (10) 

6. A method as claimed in any preceding claims wherein the token is a predetermined single character. 

7. A method as claimed in Claim 6, wherein the token is the 7 character. 

8. A method of decompressing a data sequence compressed according to the method as claimed in any preceding 
claim, comprising the steps of: 

(A) for each field within a current record other than the first record in the data sequence: 
determining whether the field contains the token. 

if the field does contain the token, replacing the token by the data item in the corresponding field of a 
preceding record; and 

(B) repeating step (A) for a plurality ot records ol the data sequence. 

9. A system for compressing a data sequence comprising a plurality of records, each record having a plurality of 
fields and each field being arranged to contain a data item, the system comprising: 

comparison means (50) which, for a current field within a current record other than the first record in the data 
sequence, is arranged to compare the data item in the current field with the data item in the corresponding 
field of a preceding record; and 

match indication means (70), responsive to a signal from the comparison means (60) indicating that the data 
item matches the data item in the corresponding field, for replacing the current field data item by a token 
indicating the match; 
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thm comparison me^ns (SQ) being arranged to repetitively perform the comparison p r ocess on a predetermined 
number of fields jo a plurality of records ol ihe data sequence 

10. A system as claimed in Claim 9. wherein the comparison means (60) is arranged to periorm the comparison process 
on the predetermined number of fields in every record ol the data sequence. 

11. A system as claimed in Claim 9 or Claim 10, wherein the predetermined number of fields is every field in the current 
record. 

1 2. A system as claimed in any of claims 9 to 1 1 , wherein the comparison means (50) is arranged to compare the data 
Item in the current lie Id with the data item in the corresponding field of the immediately preceding record, and the 
system further comprises storing means operable, subsequent to the comparison means having performed the 
comparison lor the predetermined number of fields in the current record, to store (290) the uncompressed current 
record 3s the immediately preceding record for use by the comparison means (60) when performing the comparison 
step on the predetermined number of fields of a subsequent record. 

13. A system as claimed in any of claims 9 to 12, wherein the system is incorporated within a server computer (30), 
the server being arranged to output the data records ol the data sequence as compressed by the system tor transler 
over a network to a client computer (10). 



14. A system as claimed in any of claims 9 to 13, wherein the token is a predetermined single character. 
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(57) The present invention provides a system and 
method for compressing a data sequence comprising a 
plurality of records, each record having a plurality of 
fields and each field being arranged to contain a data 
item. The system comprises: comparison means which, 
for a current field within a current record other than the 
first record in the data sequence, is arranged to compare 
the data item in the current field with the data item in the 
corresponding field of a preceding record: and match 
indication means, responsive to a signal from the com- 
parison means indicating that the data item matches the 
data item in the corresponding field, for replacing the 
current field data item by a token indicating the match. 
The comparison means is arranged to repetitively per- 
form the comparison process on a predetermined 
number of fields in a plurality of records of the data se- 
quence. Preferably the system is incorporated within a 
server computer, the server being arranged to output the 
data records of the data sequence as compressed by 
the system for transfer over a network to a client com- 
puter. 

The technique of the present invention is a very 
quick algorithm, taking very little resource. It does not 
preveni the use of more traditional compression tech- 
niques, and is simple. Further, it recognises the field 
structure of tho data and ucoc thic a? a method to 
achieve good compression. Its algorithm is not affected 
by the host computer architecture nor thai of the trans- 
port layers. Additionally it can reduce the CPU resourc- 
es required at the client side thereby improving per- 
formance above and beyond the data reductions. 
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